0%

(IJCAI 2018) A Multi-task Learning Approach for Image Captioning

Keyword [Multi-task Learning]

Zhao W, Wang B, Ye J, et al. A Multi-task Learning Approach for Image Captioning[C]//IJCAI. 2018: 1205-1211.



1. Overview


In this paper, it proposes Multi-task Learning Approach for Image Captioning (MLAIC)

  • multi-object classification model. regularize CNN encoder
  • syntax generation model
  • image captioning model. benifit from object categorization and syntax knowledge



2. Architecture




2.1. Ground-Truth

2.1.1. Object Vector



  • 1 is present.
  • C. the number of categories

2.1.2. Image Description



  • T. sentence length

2.1.3. Combinatory Category Gramma (CCG)



2.2. Shared CNN Encoder

  • image to L vectors (HxWxC–>LxD)
  • L=14x14, D=2048


  • shared CNN encoder fine-tuned with both captioning and classification

2.2.1. Classification



2.3. Shared LSTM Decoder

  • LSTM_1. top-down visual attention model
  • LSTM_2. language model

2.3.1. LSTM_1

  • Input.


  1. z. image mean feature
  2. e^w. previously generated word embedding
  3. e_s. previously generated syntax embedding
  4. h^(2). previously output of LSTM_2
  • Output.


2.3.2. ### LSTM_2

  • Input


    sigma. feed-forward NN
  • Output


2.4. Multi-Task Learning

  • classification. multi-label margin loss



  • other two task. NLL



  • All




3. Experiments


3.1. Details

  • λ1 = 0.2, λ2 = 0.7, λ3 = 0.1
  • LSTM_1 = 1000 unit, LSTM_2 = 512 unit
  • word embedding = 512, CCG supertag embedding = 100
  • beam size = 5

3.2. Comparison



3.3. Ablation